NHGIS Data

import numpy as np
import matplotlib.pyplot as plt  # To visualize
import pandas as pd  # To read data
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from stargazer.stargazer import Stargazer
from IPython.core.display import HTML
import seaborn as sn

First, we are loading in the required libraries and data. I am using python for this because I had trouble getting stata to work well.

data1 = pd.read_csv('nhgis/3.csv', encoding = "ISO-8859-1")
data2 = pd.read_csv('nhgis/4.csv', encoding = "ISO-8859-1")

This next section is simply me merging the two datasets together. Then, I created a statistic that represents the fraction of the population with a B.A. or higher. Finally, I isolated the columns we needed for this assignment in the DataFrame.

data = pd.merge(data1, data2, on='GISJOIN')

data['PctBA'] = ((data['ALWGE022'] + data['ALWGE023'] + data['ALWGE024'] + data['ALWGE025'])/data['ALWGE001'])

data = data[['GISJOIN', 'CBSA_x', 'CBSAA_x', 'ALWGE001', 'ALX5E001', 'AMEME001', 'PctBA']]

data.columns = ['GISJOIN', 'CBSA_Name', 'CBSA_Code', 'Population', 'Income_Per_Capita', 'Gini', 'Percent_with_BA']

Question G:

We decided to not restrict the size, since the data tables we downloaded were already based around CBSA’s, and the lowest CBSA in the dataset was already over 6000 people large, which seemed large enough to be accurate.

Quesiton H

Just from sorting the values, the least educated CBSA’s all seem to a lot lower in population than the higher educated CBSAs. In addition, they appear to have a lower average per capita income. The Gini seems somewhat lower in the lesser educated CBSAs, however.

data.sort_values('Percent_with_BA')

	GISJOIN	CBSA_Name	CBSA_Code	Population	Income_Per_Capita	Gini	Percent_with_BA
658	G37770	Pearsall, TX Micro Area; San Antonio-New Braun...	37770	12449	19256	0.5087	0.072857
173	G17500	Clewiston, FL Micro Area; Cape Coral-Fort Myer...	17500	25671	19167	0.4681	0.082973
704	G39700	Raymondville, TX Micro Area; Brownsville-Harli...	39700	13296	14888	0.4650	0.088598
82	G13500	Bennettsville, SC Micro Area	13500	19108	17456	0.4782	0.088916
828	G44900	Summerville, GA Micro Area; Chattanooga-Clevel...	44900	17107	18715	0.4476	0.089320
...	...	...	...	...	...	...	...
403	G27060	Ithaca, NY Metro Area	27060	59774	33075	0.4960	0.535183
201	G18700	Corvallis, OR Metro Area	18700	55359	33817	0.4801	0.541484
31	G11460	Ann Arbor, MI Metro Area	11460	229268	41399	0.4848	0.559271
105	G14500	Boulder, CO Metro Area	14500	212755	46826	0.4848	0.620766
497	G31060	Los Alamos, NM Micro Area; Albuquerque-Santa F...	31060	13260	60746	0.3909	0.674057

938 rows × 7 columns

Regression 1

This Regression is an Simple Linear Regression of Income Per Capita on Total Population (city size)

Income_Pop = sm.OLS(data['Income_Per_Capita'], sm.add_constant(data['Population'])).fit()

print(Income_Pop.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:      Income_Per_Capita   R-squared:                       0.097
Model:                            OLS   Adj. R-squared:                  0.096
Method:                 Least Squares   F-statistic:                     100.3
Date:                Mon, 11 Oct 2021   Prob (F-statistic):           1.70e-22
Time:                        14:06:38   Log-Likelihood:                -9467.9
No. Observations:                 938   AIC:                         1.894e+04
Df Residuals:                     936   BIC:                         1.895e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.752e+04    199.705    137.821      0.000    2.71e+04    2.79e+04
Population     0.0026      0.000     10.014      0.000       0.002       0.003
==============================================================================
Omnibus:                      118.872   Durbin-Watson:                   1.983
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              408.315
Skew:                           0.587   Prob(JB):                     2.17e-89
Kurtosis:                       6.012   Cond. No.                     8.16e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.16e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

We observe a statistically significant positive relationship between Income Per Capita and the Size of a city

Next we run a regression of Gini on Population:

Pop_Gini = sm.OLS(data['Gini'], sm.add_constant(data['Population'])).fit()

print(Pop_Gini.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Gini   R-squared:                       0.023
Model:                            OLS   Adj. R-squared:                  0.022
Method:                 Least Squares   F-statistic:                     21.79
Date:                Mon, 11 Oct 2021   Prob (F-statistic):           3.49e-06
Time:                        14:06:39   Log-Likelihood:                 1956.8
No. Observations:                 938   AIC:                            -3910.
Df Residuals:                     936   BIC:                            -3900.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4521      0.001    441.058      0.000       0.450       0.454
Population  6.123e-09   1.31e-09      4.667      0.000    3.55e-09     8.7e-09
==============================================================================
Omnibus:                       53.772   Durbin-Watson:                   1.909
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               71.513
Skew:                           0.510   Prob(JB):                     2.96e-16
Kurtosis:                       3.888   Cond. No.                     8.16e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.16e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

We get a somewhat confusing result. The coefficient is extremely small, but also has a significant p value. I think this is simply because the relationship is there, but gini coefficients are a much smaller number than the population, so the coefficients are very very small. If you reverse it and have population as the dependent variable, you get the following:

Pop_Gini_Reversed = sm.OLS(data['Population'], sm.add_constant(data['Gini'])).fit()
print(Pop_Gini_Reversed.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Population   R-squared:                       0.023
Model:                            OLS   Adj. R-squared:                  0.022
Method:                 Least Squares   F-statistic:                     21.79
Date:                Mon, 11 Oct 2021   Prob (F-statistic):           3.49e-06
Time:                        14:06:40   Log-Likelihood:                -14008.
No. Observations:                 938   AIC:                         2.802e+04
Df Residuals:                     936   BIC:                         2.803e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.461e+06   3.62e+05     -4.038      0.000   -2.17e+06   -7.51e+05
Gini        3.715e+06   7.96e+05      4.667      0.000    2.15e+06    5.28e+06
==============================================================================
Omnibus:                     1493.733   Durbin-Watson:                   1.978
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           684543.407
Skew:                           9.648   Prob(JB):                         0.00
Kurtosis:                     133.930   Cond. No.                         39.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Finally we get to the regression of Gini on Income per Capita. There is a significant but negative relationship between the two variables.

Income_Gini = sm.OLS(data['Gini'], sm.add_constant(data['Income_Per_Capita'])).fit()

print(Income_Gini.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Gini   R-squared:                       0.022
Model:                            OLS   Adj. R-squared:                  0.021
Method:                 Least Squares   F-statistic:                     20.65
Date:                Mon, 11 Oct 2021   Prob (F-statistic):           6.25e-06
Time:                        14:06:40   Log-Likelihood:                 1956.3
No. Observations:                 938   AIC:                            -3909.
Df Residuals:                     936   BIC:                            -3899.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 0.4738      0.005    103.256      0.000       0.465       0.483
Income_Per_Capita -7.249e-07    1.6e-07     -4.544      0.000   -1.04e-06   -4.12e-07
==============================================================================
Omnibus:                       29.614   Durbin-Watson:                   1.892
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               34.674
Skew:                           0.373   Prob(JB):                     2.96e-08
Kurtosis:                       3.575   Cond. No.                     1.34e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.34e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

stargazer = Stargazer([Income_Pop, Pop_Gini, Income_Gini])
stargazer.significant_digits(4)

stargazer.custom_columns(['Income on Population', 'Gini on Population', 'Gini on Income'], [1, 1, 1])
HTML(stargazer.render_html())



	Income on Population	Gini on Population	Gini on Income
	(1)	(2)	(3)

Income_Per_Capita			-0.0000^***
			(0.0000)
Population	0.0026^***	0.0000^***
	(0.0003)	(0.0000)
const	27523.4610^***	0.4521^***	0.4738^***
	(199.7046)	(0.0010)	(0.0046)

Observations	938	938	938
R²	0.0968	0.0227	0.0216
Adjusted R²	0.0958	0.0217	0.0205
Residual Std. Error	5859.5028 (df=936)	0.0301 (df=936)	0.0301 (df=936)
F Statistic	100.2890^*** (df=1; 936)	21.7852^*** (df=1; 936)	20.6460^*** (df=1; 936)

Note:	^p<0.1; ^p<0.05; ^**p<0.01

Above is the table comparing all of the outputted regression results.